tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
Draft
phongn wants to merge 1 commit into
Draft
tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167phongn wants to merge 1 commit into
phongn wants to merge 1 commit into
Conversation
The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
@phongn Should we do this for in place too? |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a header-only SIMD ASCII lowercase-copy helper in tscore and switches the URL cache-key fast path to use it instead of the local scalar loop.
Changes:
- Adds
ts::memcpy_tolowerwith scalar, SSE2, AVX2, AVX-512BW, and NEON paths. - Replaces URL fast-path scheme/host lowercasing with the shared helper.
- Adds an optional Catch2 benchmark/correctness harness for the helper.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
include/tscore/ink_memcpy_tolower.h |
Defines the new SIMD/scalar lowercase-copy helper. |
src/proxy/hdrs/URL.cc |
Uses ts::memcpy_tolower in cache-key fast-path canonicalization. |
tools/benchmark/benchmark_memcpy_tolower.cc |
Adds benchmark and correctness checks for the helper. |
tools/benchmark/CMakeLists.txt |
Builds the new benchmark target when benchmarks are enabled. |
Comment on lines
+59
to
+60
| inline void | ||
| memcpy_tolower(char *dst, const char *src, std::size_t n) noexcept |
Comment on lines
+19
to
+24
| Implementation note: the bodies are stacked widest-first and each | ||
| drains its block size before falling through to the next. A build | ||
| with AVX-512BW gets the 64-byte body as the main loop, then at most | ||
| one 32-byte AVX2 iteration and one 16-byte SSE2 iteration to drain | ||
| the remainder before the scalar tail handles 0-15 bytes. Builds | ||
| without the wider ISAs simply skip those blocks. Selection is purely |
Comment on lines
+157
to
+164
| return output_scalar[0]; | ||
| }; | ||
|
|
||
| std::string simd_name = "ts::mct " + std::to_string(sz) + "B"; | ||
| BENCHMARK(simd_name.c_str()) | ||
| { | ||
| ts::memcpy_tolower(output_simd.data(), input.data(), sz); | ||
| return output_simd[0]; |
Comment on lines
+163
to
+164
| ts::memcpy_tolower(output_simd.data(), input.data(), sz); | ||
| return output_simd[0]; |
masaori335
reviewed
May 19, 2026
| // the speedup that path already provides. | ||
| // | ||
| // Inspired by Tony Finch's copytolower64.c | ||
| // (https://dotat.at/cgi/git/vectolower.git/). |
Contributor
There was a problem hiding this comment.
The referenced copytolower64.c seems to be under a BSD/MIT-style license.
It's worth to mention it in our NOTICE file. https://github.com/apache/trafficserver/blob/master/NOTICE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a SIMD-accelerated bulk ASCII tolower helper
ts::memcpy_tolowerintscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising-march(x86-64-v3= AVX2,x86-64-v4= AVX-512BW); a stock x86_64 build keeps SSE2.Behavior matches
ParseRules::ink_tolowerexactly: bytes inA..Zmap toa..z, all others (including0x80..0xFF) pass through unchanged.Implementation notes
_mm512_mask_add_epi8to fuse the conditional+0x20into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.src/proxy/hdrs/URL.ccdrops its static-inlinememcpy_tolowerand callsts::memcpy_tolowerinstead.Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)
Mean ns per call from
tools/benchmark/benchmark_memcpy_tolower:-mavx2)-mavx512bw)Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.
URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.
Test plan
tools/benchmark/benchmark_memcpy_tolowerruns 269 correctness assertions covering:A..Zare remapped — guards against any future widening of the case-fold range.-mavx2-mavx2,-mavx512bw, and the defaultcmake --build build -t formatclean.src/proxy/hdrs/libhdrs.abuilds clean with the updated URL.cc.Notes for reviewers
-march=x86-64-v3or-march=x86-64-v4.<immintrin.h>/<arm_neon.h>only inside the#ifthat needs them, so other architectures don't pull them in.HPACK.cc,QPACK.cc,UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.🤖 Generated with Claude Code